Serveur d'exploration MERS

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

DSK: k-mer counting with very low memory usage.

Identifieur interne : 000132 ( France/Analysis ); précédent : 000131; suivant : 000133

DSK: k-mer counting with very low memory usage.

Auteurs : Guillaume Rizk [France] ; Dominique Lavenier ; Rayan Chikhi

Source :

RBID : pubmed:23325618

Descripteurs français

English descriptors

Abstract

Counting all the k-mers (substrings of length k) in DNA/RNA sequencing reads is the preliminary step of many bioinformatics applications. However, state of the art k-mer counting methods require that a large data structure resides in memory. Such structure typically grows with the number of distinct k-mers to count. We present a new streaming algorithm for k-mer counting, called DSK (disk streaming of k-mers), which only requires a fixed user-defined amount of memory and disk space. This approach realizes a memory, time and disk trade-off. The multi-set of all k-mers present in the reads is partitioned, and partitions are saved to disk. Then, each partition is separately loaded in memory in a temporary hash table. The k-mer counts are returned by traversing each hash table. Low-abundance k-mers are optionally filtered. DSK is the first approach that is able to count all the 27-mers of a human genome dataset using only 4.0 GB of memory and moderate disk space (160 GB), in 17.9 h. DSK can replace a popular k-mer counting software (Jellyfish) on small-memory servers.

DOI: 10.1093/bioinformatics/btt020
PubMed: 23325618


Affiliations:


Links toward previous steps (curation, corpus...)


Links to Exploration step

pubmed:23325618

Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en">DSK: k-mer counting with very low memory usage.</title>
<author>
<name sortKey="Rizk, Guillaume" sort="Rizk, Guillaume" uniqKey="Rizk G" first="Guillaume" last="Rizk">Guillaume Rizk</name>
<affiliation wicri:level="3">
<nlm:affiliation>Algorizk, 75013 Paris, France.</nlm:affiliation>
<country xml:lang="fr">France</country>
<wicri:regionArea>Algorizk, 75013 Paris</wicri:regionArea>
<placeName>
<region type="region" nuts="2">Île-de-France</region>
<settlement type="city">Paris</settlement>
</placeName>
</affiliation>
</author>
<author>
<name sortKey="Lavenier, Dominique" sort="Lavenier, Dominique" uniqKey="Lavenier D" first="Dominique" last="Lavenier">Dominique Lavenier</name>
</author>
<author>
<name sortKey="Chikhi, Rayan" sort="Chikhi, Rayan" uniqKey="Chikhi R" first="Rayan" last="Chikhi">Rayan Chikhi</name>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">PubMed</idno>
<date when="2013">2013</date>
<idno type="RBID">pubmed:23325618</idno>
<idno type="pmid">23325618</idno>
<idno type="doi">10.1093/bioinformatics/btt020</idno>
<idno type="wicri:Area/PubMed/Corpus">001D24</idno>
<idno type="wicri:explorRef" wicri:stream="PubMed" wicri:step="Corpus" wicri:corpus="PubMed">001D24</idno>
<idno type="wicri:Area/PubMed/Curation">001D24</idno>
<idno type="wicri:explorRef" wicri:stream="PubMed" wicri:step="Curation">001D24</idno>
<idno type="wicri:Area/PubMed/Checkpoint">001B72</idno>
<idno type="wicri:explorRef" wicri:stream="Checkpoint" wicri:step="PubMed">001B72</idno>
<idno type="wicri:Area/Ncbi/Merge">000A17</idno>
<idno type="wicri:Area/Ncbi/Curation">000A17</idno>
<idno type="wicri:Area/Ncbi/Checkpoint">000A17</idno>
<idno type="wicri:Area/Main/Merge">002043</idno>
<idno type="wicri:Area/Main/Curation">002022</idno>
<idno type="wicri:Area/Main/Exploration">002022</idno>
<idno type="wicri:Area/France/Extraction">000132</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en">DSK: k-mer counting with very low memory usage.</title>
<author>
<name sortKey="Rizk, Guillaume" sort="Rizk, Guillaume" uniqKey="Rizk G" first="Guillaume" last="Rizk">Guillaume Rizk</name>
<affiliation wicri:level="3">
<nlm:affiliation>Algorizk, 75013 Paris, France.</nlm:affiliation>
<country xml:lang="fr">France</country>
<wicri:regionArea>Algorizk, 75013 Paris</wicri:regionArea>
<placeName>
<region type="region" nuts="2">Île-de-France</region>
<settlement type="city">Paris</settlement>
</placeName>
</affiliation>
</author>
<author>
<name sortKey="Lavenier, Dominique" sort="Lavenier, Dominique" uniqKey="Lavenier D" first="Dominique" last="Lavenier">Dominique Lavenier</name>
</author>
<author>
<name sortKey="Chikhi, Rayan" sort="Chikhi, Rayan" uniqKey="Chikhi R" first="Rayan" last="Chikhi">Rayan Chikhi</name>
</author>
</analytic>
<series>
<title level="j">Bioinformatics (Oxford, England)</title>
<idno type="eISSN">1367-4811</idno>
<imprint>
<date when="2013" type="published">2013</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc>
<textClass>
<keywords scheme="KwdEn" xml:lang="en">
<term>Algorithms</term>
<term>Genome, Human</term>
<term>Humans</term>
<term>Sequence Analysis, DNA (methods)</term>
<term>Sequence Analysis, RNA (methods)</term>
<term>Software</term>
</keywords>
<keywords scheme="KwdFr" xml:lang="fr">
<term>Algorithmes</term>
<term>Analyse de séquence d'ADN ()</term>
<term>Analyse de séquence d'ARN ()</term>
<term>Génome humain</term>
<term>Humains</term>
<term>Logiciel</term>
</keywords>
<keywords scheme="MESH" qualifier="methods" xml:lang="en">
<term>Sequence Analysis, DNA</term>
<term>Sequence Analysis, RNA</term>
</keywords>
<keywords scheme="MESH" xml:lang="en">
<term>Algorithms</term>
<term>Genome, Human</term>
<term>Humans</term>
<term>Software</term>
</keywords>
<keywords scheme="MESH" xml:lang="fr">
<term>Algorithmes</term>
<term>Analyse de séquence d'ADN</term>
<term>Analyse de séquence d'ARN</term>
<term>Génome humain</term>
<term>Humains</term>
<term>Logiciel</term>
</keywords>
</textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">Counting all the k-mers (substrings of length k) in DNA/RNA sequencing reads is the preliminary step of many bioinformatics applications. However, state of the art k-mer counting methods require that a large data structure resides in memory. Such structure typically grows with the number of distinct k-mers to count. We present a new streaming algorithm for k-mer counting, called DSK (disk streaming of k-mers), which only requires a fixed user-defined amount of memory and disk space. This approach realizes a memory, time and disk trade-off. The multi-set of all k-mers present in the reads is partitioned, and partitions are saved to disk. Then, each partition is separately loaded in memory in a temporary hash table. The k-mer counts are returned by traversing each hash table. Low-abundance k-mers are optionally filtered. DSK is the first approach that is able to count all the 27-mers of a human genome dataset using only 4.0 GB of memory and moderate disk space (160 GB), in 17.9 h. DSK can replace a popular k-mer counting software (Jellyfish) on small-memory servers.</div>
</front>
</TEI>
<affiliations>
<list>
<country>
<li>France</li>
</country>
<region>
<li>Île-de-France</li>
</region>
<settlement>
<li>Paris</li>
</settlement>
</list>
<tree>
<noCountry>
<name sortKey="Chikhi, Rayan" sort="Chikhi, Rayan" uniqKey="Chikhi R" first="Rayan" last="Chikhi">Rayan Chikhi</name>
<name sortKey="Lavenier, Dominique" sort="Lavenier, Dominique" uniqKey="Lavenier D" first="Dominique" last="Lavenier">Dominique Lavenier</name>
</noCountry>
<country name="France">
<region name="Île-de-France">
<name sortKey="Rizk, Guillaume" sort="Rizk, Guillaume" uniqKey="Rizk G" first="Guillaume" last="Rizk">Guillaume Rizk</name>
</region>
</country>
</tree>
</affiliations>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Sante/explor/MersV1/Data/France/Analysis
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000132 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/France/Analysis/biblio.hfd -nk 000132 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Sante
   |area=    MersV1
   |flux=    France
   |étape=   Analysis
   |type=    RBID
   |clé=     pubmed:23325618
   |texte=   DSK: k-mer counting with very low memory usage.
}}

Pour générer des pages wiki

HfdIndexSelect -h $EXPLOR_AREA/Data/France/Analysis/RBID.i   -Sk "pubmed:23325618" \
       | HfdSelect -Kh $EXPLOR_AREA/Data/France/Analysis/biblio.hfd   \
       | NlmPubMed2Wicri -a MersV1 

Wicri

This area was generated with Dilib version V0.6.33.
Data generation: Mon Apr 20 23:26:43 2020. Site generation: Sat Mar 27 09:06:09 2021